Airbnb, Inc. is an online marketplace for arranging or offering lodging, primarily homestays, or tourism experiences. The company does not own any of the real estate listings, nor does it host events; it acts as a broker, receiving commissions from each booking. The company is based in San Francisco, California, United States.
You are consulting for a Real Estate Company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.
The real estate company has engaged your firm to build out a data product and provide your conclusions to help them understand which Zip Codes would generate the maximum Profit on short term rentals within New York City.
library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(kableExtra)
library(readxl)
library(scales)
library(RColorBrewer)
library(wesanderson)
library(plotly)
library(data.table)
library(ggpubr)
| Package | Description |
|---|---|
| library(tidyr) | For changing the layout of your data sets, to convert data into the tidy format |
| library(DT) | For HTML display of data |
| library(ggplot2) | For customizable graphical representation |
| library(dplyr) | For data manipulation |
| library(tidyverse) | Collection of R packages designed for data science that works harmoniously with other packages |
| library(kableExtra) | To display table in a fancy way |
| library(readxl) | The readxl package makes it easy to get data out of Excel and into R |
| library(scales) | The idea of the scales package is to implement scales in a way that is graphics system agnostic |
| library(RColorBrewer) | RColorBrewer is an R package that allows users to create colourful graphs with pre-made color palettes that visualize data in a clear and distinguishable manner |
| library(wesanderson) | A Wes Anderson is color palette for R |
| library(plotly) | Plotly’s R graphing library makes interactive, publication-quality graphs |
| library(data.table) | Fast aggregation of large data |
| library(ggpubr) | Grid Based Plots |
Two datasets have been provided for analysing the problem statement.
The above two datasets have been imported for our analysis.
#Reading the Zillow and Listings Data File
home_price <- read.csv("D:/College/Capital One/Data Challenge/Code/Zip_Zhvi_2bedroom.csv")
listings <- read.csv("D:/College/Capital One/Data Challenge/Code/listings.csv")
The following actions have been performed on the Zillow dataset to prepare for analysis:
Our analysis is restricted to the listing in New York City, thus we filter out all the other cities.
Changing the column name of RegionName to zipcode for better understanding.
Apply ARIMA Model to forecast the prices of listings in the year 2020 from the data available.
Creating a new column currentPrice which will hold the currentPrice in the year 2020.
The data which is available to us contains the prices only till the year 2017. Thus, our analysis would be more accurate if we had forecast the price of the listing for the year 2020.
Assuming that there is seasonality in the price and also that values depend not only on previous values (Auto Regressive AR) but also on diferences between previous values (Moving Average MA), we apply ARIMA model to predict the cost of the propeties.
zillow_forecast <- function(tempdf,cityName){ # Zillow data and New york city given as function argument
# Select zillow cost information from last 5 years and modify zillow data (tempdf) to only hold relevant columns
n <- 60
tempdf <- tempdf[,c(2,3,7,(ncol(tempdf) - n):ncol(tempdf))]
tempdf <- filter(tempdf,City == cityName) # Filter for the required city
colnames(tempdf)[colnames(tempdf) == "RegionName"] <- "zipcode" # Set proper column name to be used for merging later
tempdf$currentPrice <- NULL # Create a new column to store the latest price in 2019
# we define a for loop to iterate over each zipcode to obtain latest cost of property
for (i in 1:nrow(tempdf)) {
tmp = ts(as.vector(t(tempdf[,c(4:64)])[,i]),start = c(2012,6),frequency = 12) # Convert the monthly cost data into time series data
ARIMAfit = arima(tmp, order = c(1,1,1), seasonal = list(order = c(1,0,1),period = NA),
method = "ML")# Define ARIMA model to be used for prediction
pred = predict(ARIMAfit, n.ahead = 20)# use the ARIMA model to predict the price
predval <- pred$pred # Store the predicted values in a variable
tempdf$currentPrice[i] <- predval[length(predval)] # set the value of current price for the specific zipcode as price in Jan 2019
}
return(tempdf[,c(1,2,3,65)]) # return the filtered data containing only relevant columns
}
zillow_final <- zillow_forecast(home_price,"New York")
The data frame zillow_final is ready for our analysis.
The listings file has 48895 observations and 106 variables. However for our analysis we shall be using columns which are more informative and relevant in analysing the most profitable zipcodes for AirBnb to invest in. Thus, only the following variables are kept for the analysis and the others are dropped.
relevantcol <- c("id","zipcode","bedrooms","price","weekly_price","monthly_price","cleaning_fee","neighbourhood_group_cleansed","number_of_reviews","review_scores_rating")
listings_fil <- listings[,relevantcol]
Our analysis is restricted only for 2 bedrooms. Thus we filter out listings which has only 2 bedrooms.
listings_fil <- listings_fil %>% filter(bedrooms == 2)
The listings_fil is now ready for our analysis.
We then merge the two datasets and perform our manipulation and analysis on the merged dataset.
finalData <- merge(listings_fil,zillow_final,by = c("zipcode")) # merge data sets on zipcode
str(finalData)
## 'data.frame': 1565 obs. of 13 variables:
## $ zipcode : Factor w/ 200 levels "","07093","07302",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ id : int 19288937 13866112 31835260 5207973 17843998 4126452 35415489 2150328 31408487 3984168 ...
## $ bedrooms : int 2 2 2 2 2 2 2 2 2 2 ...
## $ price : Factor w/ 674 levels "$0.00","$1,000.00",..: 244 129 208 222 244 189 244 635 640 134 ...
## $ weekly_price : Factor w/ 599 levels "","$1,000.00",..: 1 1 1 49 1 1 1 438 1 1 ...
## $ monthly_price : Factor w/ 681 levels "","$1,000.00",..: 1 1 1 1 1 1 1 1 1 466 ...
## $ cleaning_fee : Factor w/ 207 levels "","$0.00","$1,000.00",..: 13 193 22 163 182 26 26 5 5 44 ...
## $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ number_of_reviews : int 99 3 6 1 8 13 4 0 18 11 ...
## $ review_scores_rating : int 97 100 93 NA 88 80 100 NA 93 91 ...
## $ City : Factor w/ 4684 levels "Aberdeen","Abilene",..: 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 ...
## $ SizeRank : int 21 21 21 21 21 21 21 21 21 21 ...
## $ currentPrice : num 2171496 2171496 2171496 2171496 2171496 ...
The variable names for the final merged dataset are not self-explanatory. Renaming the column names to make it more self-explanatory.
colnames(finalData) <- c("zipcode","id","bedrooms","per_night_price","weekly_price","monthly_price","cleaning_fee","neighbourhood","number_of_reviews","review_scores_rating","city","size_rank","current_price")
The factor levels in the City column are incorrect as only New York City is required. Thus we correct the factor level of the City column.
finalData$city <- factor(finalData$city, levels = c("New York"))
All the values in the price columns(per_night_price, weekly_price, monthly_price, cleaning_fee) have a dollar sign stored in them which will hinder in analysis of that column. Thus, we remove the dollar sign from these columns.
Our final merged dataset is now ready for our analysis.
cols <- c("per_night_price", "weekly_price", "monthly_price","cleaning_fee") # selecting columns to be cleaned
# Specify a function to replace characters with whitespace
replace_dollar <- function(x){
price <- as.numeric(gsub("[$,]","",x)) # this function removes $ from data
return(price)
}
# Apply function to replace characters with whitespace
finalData[cols] <- lapply(finalData[cols], replace_dollar) # running the above defined function on cols
# Final Check of the structure of the data
str(finalData)
## 'data.frame': 1565 obs. of 13 variables:
## $ zipcode : Factor w/ 200 levels "","07093","07302",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ id : int 19288937 13866112 31835260 5207973 17843998 4126452 35415489 2150328 31408487 3984168 ...
## $ bedrooms : int 2 2 2 2 2 2 2 2 2 2 ...
## $ per_night_price : num 250 165 218 230 250 200 250 850 88 170 ...
## $ weekly_price : num NA NA NA 1200 NA NA NA 6000 NA NA ...
## $ monthly_price : num NA NA NA NA NA NA NA NA NA 4500 ...
## $ cleaning_fee : num 110 85 120 60 75 125 125 100 100 150 ...
## $ neighbourhood : Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ number_of_reviews : int 99 3 6 1 8 13 4 0 18 11 ...
## $ review_scores_rating: int 97 100 93 NA 88 80 100 NA 93 91 ...
## $ city : Factor w/ 1 level "New York": 1 1 1 1 1 1 1 1 1 1 ...
## $ size_rank : int 21 21 21 21 21 21 21 21 21 21 ...
## $ current_price : num 2171496 2171496 2171496 2171496 2171496 ...
The final cleaned dataset can be found below in an interactive table.
datatable(finalData, filter = 'top')
| Variable | Class | Description |
|---|---|---|
| zipcode | Factor | Zip code where the property is located. |
| id | int | Identifier used by AirBnB for the listing |
| bedrooms | int | Indicates the number of bedrooms within the property. |
| per_night_price | num | Price the host is charging to stay per night. |
| weekly_price | num | Weekly price the host is charging; this could be discounted for longer term stays. |
| monthly_price | num | Monthly price the host is charging; this could be discounted for longer term stays. |
| cleaning_fee | num | Price the host is charging to clean up after your stay. |
| neighbourhood | Factor | Neighborhood where the property is located |
| number_of_reviews | int | Number of reviews received for the property for its entire existence within AirBnB. |
| review_scores_rating | int | Overall score given based on accuracy, cleanliness, check-in, communication, location, and value. |
| city | Factor | City where the property is located. |
| size_rank | int | Population of the area; the lower the number the greater the population. |
| current_price | num | Current Price of the Listing |
We first analyze the number of listings available in each of the zipcode. This will give us an idea of the zipcodes which have the maximum number of AirBnb listings.
finalDataWithCount <- finalData %>%
group_by(zipcode) %>%
summarise(count = n())
ggplot(finalDataWithCount, aes(x = reorder(zipcode,-count), y = count)) +
theme_set(theme_light()) +
geom_bar(stat = "identity", width = .5, fill = "tomato3") +
ggtitle("Number of 2 Bedroom Listings in NYC in each zipcode") +
xlab("Zipcode") +
ylab("Number of 2 Bedroom Listings") +
theme(axis.text.x = element_text(angle = 65, vjust = 0.8))
From the above graph, we conclude that the zipcode 11215 has the maximum number of 2 bedroom listings in New York City. Also, the top 10 zipcodes with maximum number of 2 bedroom listings are:
We now analyze which neighbourhoods in New York City have the maximum number of 2 bedroom listings.
Neighbourhood_Analysis <- finalData %>%
group_by(finalData$neighbourhood) %>%
summarise(count = n())
ggplot(Neighbourhood_Analysis, aes(x = reorder(`finalData$neighbourhood`,count), y = count, fill = `finalData$neighbourhood`)) +
theme_light() +
geom_bar(stat = "identity", width = .5, fill = "#1F618D") +
coord_flip() +
geom_text(aes(label = count), hjust = 0.0, color = "blue", size = 3.5) +
ggtitle("Number of 2 Bedroom Listings in each Neighbourhood") +
xlab("Neighbourhood") + ylab("Number of 2 Bedroom Listings") +
theme(axis.text.x = element_text(angle = 65, vjust = 0.8))
From the graph above, we conlcude the following results:
The number of reviews variable can be used as an estimate to foresee the most popular zipcodes for AirBnb in New York City. The more the number of reviews, more popular are the listings in that zipcode. Thus, we now analyze the zipcodes which have the maximum number of reviews.
finalData %>% group_by(zipcode) %>% summarise(reviews_count = sum(number_of_reviews)) %>%
ggplot( aes(x = zipcode, y = reviews_count)) +
geom_segment( aes(x = reorder(zipcode,reviews_count), xend = zipcode, y = 0, yend = reviews_count), color = "blue") +
geom_point( color = "orange", size = 4 ) +
theme_classic() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
) +
xlab(" Zipcode ") +
ylab("Number of reviews")
From the graph we can conlcude that the zipcodes with most number of reviews are:
We then visualize the number of reviews grouped by Neighbourhood and also compute the average per night price of the AirBnB listings in those neighbourhoods.
reviews_data <- finalData %>% group_by(neighbourhood, zipcode, size_rank) %>% summarise(reviews_count = sum(number_of_reviews), avg_night_price = round(mean(per_night_price), digits = 2), avg_rating = round(mean(review_scores_rating, na.rm = TRUE), digits = 2))
review_price_plot <- reviews_data %>% arrange(desc(size_rank)) %>%
ggplot(aes(x = reviews_count, y = avg_night_price, size = avg_rating, color = neighbourhood)) +
geom_point(alpha = 0.5) +
labs(y = "Average Price Per Night in that Zipcode", x = "Total Number of Reviews") +
ggtitle("Number of Reviews and Price for Zipcodes in Each Neighbourhood") +
scale_size(range = c(0, 10), name = "Size Rank(M)")
ggplotly(review_price_plot)
The size of the bubble represents the mean average rating of listings in that zipcode of the given neighbourhood for the listings. Bigger the bubble, higher the mean user ratings given to the AirBnBs listing in the neighbourhood. From the graph, we can easily conclude that:
In order to analyse the most profitable zipcodes for AirBnB to invest in future, we need to first analyse the cost of investment which is needed to establish an AirBnB listing in that particular zipcode of the neighbourhood.
medianCost <- finalData %>%
select(zipcode,neighbourhood, current_price) %>%
filter(neighbourhood == c("Manhattan","Brooklyn","Queens","Staten Island")) %>%
group_by(neighbourhood,zipcode) %>%
summarise_all(funs(median)) %>%
ggplot(aes(x = reorder(zipcode, -current_price), y = current_price, fill = neighbourhood )) +
ggtitle("Average Cost of Listings in various Zipcodes") +
geom_bar(stat = "identity") + scale_y_continuous(labels = scales::comma) +
labs(y = "Average Cost", x = "Zipcode") +
theme_classic2() +
theme(plot.background = element_blank(),panel.grid.major = element_blank(),panel.grid.minor = element_blank(),panel.border = element_blank(),axis.text.x = element_text(angle = 90, hjust = 1)) + guides(fill = guide_legend(title = "Neighbourhood"))
ggplotly(medianCost)
From the above graph, we can conclude the zipcodes which has the maximum and minimum average price for each neighbourhood.
The next stage of the analysis involves calculating the revenue earned per year in each of the zipcodes in New York City. As given in the problem, we have asssumed the occupancy rate for the listings to be 75% (or 0.75).
revenue_analysis <- finalData
occupancy_rate <- 0.75
Year_days <- 365 # number of days in a year
# Generate the revenue for a year per property
revenue_analysis$revenue_per_year <- occupancy_rate*Year_days*revenue_analysis$per_night_price
# Obtain the Revenue/Cost ratio
revenue_analysis$revenue_cost_ratio <- ((revenue_analysis$revenue_per_year)/(revenue_analysis$current_price))
# Breakeven period
revenue_analysis$breakeven_point <- (revenue_analysis$current_price/(revenue_analysis$revenue_per_year - (occupancy_rate*52*revenue_analysis$cleaning_fee)))
# Revenue earned over 5 years
revenue_analysis$year_5 <- ((revenue_analysis$revenue_per_year*5) - (5*occupancy_rate*52*revenue_analysis$cleaning_fee)) - revenue_analysis$current_price
# Revenue earned over 10 years
revenue_analysis$year_10 <- ((revenue_analysis$revenue_per_year*10) - (10*occupancy_rate*52*revenue_analysis$cleaning_fee)) - revenue_analysis$current_price
# Revenue earned over 15 years
revenue_analysis$year_15 <- ((revenue_analysis$revenue_per_year*15) - (5*occupancy_rate*52*revenue_analysis$cleaning_fee)) - revenue_analysis$current_price
# Revenue earned over 20 years
revenue_analysis$year_20 <- ((revenue_analysis$revenue_per_year*20) - (20*occupancy_rate*52*revenue_analysis$cleaning_fee)) - revenue_analysis$current_price
# Revenue earned over 25 years
revenue_analysis$year_25 <- ((revenue_analysis$revenue_per_year*25) - (25*occupancy_rate*52*revenue_analysis$cleaning_fee)) - revenue_analysis$current_price
# Revenue earned over 30 years
revenue_analysis$year_30 <- ((revenue_analysis$revenue_per_year*30) - (30*occupancy_rate*52*revenue_analysis$cleaning_fee)) - revenue_analysis$current_price
revenue_plot <- revenue_analysis %>%
select(zipcode,neighbourhood, revenue_per_year) %>%
filter(neighbourhood == c("Manhattan","Brooklyn","Queens","Staten Island")) %>%
group_by(neighbourhood,zipcode) %>%
summarise_all(funs(median)) %>%
ggplot(aes(x = reorder(zipcode, -revenue_per_year), y = revenue_per_year, fill = neighbourhood )) +
ggtitle("Average Revenue from Listings in various Zipcodes") +
geom_bar(stat = "identity") + scale_y_continuous(labels = scales::comma) +
labs(y = "Average Revenue", x = "Zipcode") +
theme_classic() +
theme(plot.background = element_blank(),panel.grid.major = element_blank(),panel.grid.minor = element_blank(),panel.border = element_blank(),axis.text.x = element_text(angle = 90, hjust = 1)) + guides(fill = guide_legend(title = "Neighbourhood"))
ggplotly(revenue_plot)
From the above graph, we can conclude that the ten zipcodes which have the maximum revenue earned per year are:
We now need to analyse the profit earned in the zipcodes of New York City over a period of 30 years. This would help us in concluding which zipcodes will start earning profit the earliest and which zipcodes will earn the maximum profit during this tenure.
For the analysis, I have divided the time into intervals of 5 years, 10 years, 15 years, 20 years, 25 years and 30 years and analysed the profit earned during that tenure.
revenue_cost_ratio <- revenue_analysis %>%
group_by(zipcode, current_price) %>%
summarise( mean_revenue_per_year = round(mean(revenue_per_year, na.rm = TRUE), digits = 2), mean_revenue_5_year = round(mean(year_5, na.rm = TRUE), digits = 2), mean_revenue_10_year = round(mean(year_10, na.rm = TRUE), digits = 2), mean_revenue_15_year = round(mean(year_15, na.rm = TRUE), digits = 2), mean_revenue_20_year = round(mean(year_20, na.rm = TRUE), digits = 2), mean_revenue_25_year = round(mean(year_25, na.rm = TRUE), digits = 2), mean_revenue_30_year = round(mean(year_30, na.rm = TRUE), digits = 2))
revenue_5_years <- ggplot(revenue_cost_ratio, aes(y = mean_revenue_5_year , x = zipcode)) +
coord_flip() +
labs(y = "Profit in 5 years", x = "Zipcode") +
geom_bar(aes(fill = mean_revenue_5_year < 0), stat = "identity") + scale_fill_manual(guide = FALSE, breaks = c(TRUE, FALSE), values = c("red", "green"))
revenue_10_years <- ggplot(revenue_cost_ratio, aes(y = mean_revenue_10_year , x = zipcode)) +
coord_flip() +
labs(y = "Profit in 10 years", x = "Zipcode") +
geom_bar(aes(fill = mean_revenue_10_year < 0), stat = "identity") + scale_fill_manual(guide = FALSE, breaks = c(TRUE, FALSE), values = c("red", "green"))
revenue_15_years <- ggplot(revenue_cost_ratio, aes(y = mean_revenue_15_year , x = zipcode)) +
coord_flip() +
labs(y = "Profit in 15 years", x = "Zipcode") +
geom_bar(aes(fill = mean_revenue_15_year < 0), stat = "identity") + scale_fill_manual(guide = FALSE, breaks = c(TRUE, FALSE), values = c("green", "red"))
revenue_20_years <- ggplot(revenue_cost_ratio, aes(y = mean_revenue_20_year , x = zipcode)) +
coord_flip() +
labs(y = "Profit in 20 years", x = "Zipcode") +
geom_bar(aes(fill = mean_revenue_20_year < 0), stat = "identity") + scale_fill_manual(guide = FALSE, breaks = c(TRUE, FALSE), values = c("green", "red"))
revenue_25_years <- ggplot(revenue_cost_ratio, aes(y = mean_revenue_25_year , x = zipcode)) +
coord_flip() +
labs(y = "Profit in 25 years", x = "Zipcode") +
geom_bar(aes(fill = mean_revenue_25_year < 0), stat = "identity") + scale_fill_manual(guide = FALSE, breaks = c(TRUE, FALSE), values = c("green", "red"))
revenue_30_years <- ggplot(revenue_cost_ratio, aes(y = mean_revenue_30_year , x = zipcode)) +
coord_flip() +
labs(y = "Profit in 30 years", x = "Zipcode") +
geom_bar(aes(fill = mean_revenue_30_year < 0), stat = "identity") + scale_fill_manual(guide = FALSE, breaks = c(TRUE, FALSE), values = c("green", "red"))
ggarrange(revenue_5_years, revenue_10_years, revenue_15_years, revenue_20_years,revenue_25_years, revenue_30_years, ncol = 3, nrow = 2)
In the above graph, red indicated loss and green indicates profit.
From the above graph, we can conclude the following results:
From the graphs and results obtained, the top 5 zipcodes which are the best choice for investment are:
Thus these 5 zipcodes are the most profitable to invest in. Choosing these 5 zipcodes also helps the company achieve a good coverage over New York City as it covers all the neighbourhoods in New York City which are available in the data set.
Use Amenities column for better prediction of missing square foot area. Additionally, use it to see how price is impacted when there are more amenities. That impact should be considered to scale rent for the same number of amenities for all properties.
Analyse the importance / role which hosts play in the popularity amongst listings in the various neighbourhoods.
Use Text Mining to analyse and infer results from Text Columns such as Description, etc to see if we can infer any common characteristics which are there in popular AirBnB listings.